An analysis of almost any social media data can can be rather telling of how subgroups of a population interact with each other on a large scale. We are interested in the content of these interactions and how they vary throughout the United States over the few days that our data spans.
Ever wanted to know what everyone’s been tweeting about? Well, thanks to Twitter’s use of the hasthag system, that’s already possible. But how about the most popular places everyone’s been tweeting from? Or how about a simplified way to see how all those twitter users are feeling? Thanks to some in-depth exploratory analyses from Dr. Jeff Goldsmith’s (Columbia University’s Mailman School of Public Health) Team AwesomeTM, and courtesy of Followthehashtag’s publicly available twitter APIs, even this is possible.
In a rapidly changing and increasingly tech-based world, people now have the power to essentially react to global events happening thousands of miles away in real time. Social media as whole, but twitter especially, are some of the biggest domains for capturing these reactions. Our team’s motivation for this analysis comes from a desire to aggregate these reactions in as compact and sensible format as possible.
What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis?
Our initial ideas for analysis were:
Identify different languages
Which accounts get the most traffic
Geolocation traffic
Any events happened during time period, measure influence
Percent that use emoji’s, hashtags
Positive/negative words, correlation with time?
Percent of tweets devoted to a given topic: political, food, hobbies/lifestyle, etc.
Correlation between location and sentiment
Advertising percentage, success
Content analysis: top tweets & retweets
Correlation of tweets and events of days covered
As we looked at the dataset and analyzed the variables available to us, we found that some of these questions were outside of the scope of the information or tools available to us.
We decided to focus on just a few areas:
Positive/negative words
Correlation with time
Correlation between location and sentiment
The dataset we used from Followthehashtag is a comprehensive but incomplete list of 200,000 tweets from users across the United States (and outside the U.S., but we focused on domestic tweets) from April 14, 2016 to April 16, 2016, which comes as an easy-to-access csv file within a zipped folder. For each tweet, user information such as name, location (latitude/longitude), number of followers, and the entire content of the tweet itself is given.
We chose this dataset mainly because it was already in a nice form, and needed minimal cleaning. It included the varaibles we were interested in, such as tweet content for sentiment analysis and location data for mapping.
# The sentiment function takes a really long time so I created a new data file so it didn't have to be run each time
# The code to create the data file used throughout our analysis is found below
# tweets <-
# read_excel("~/USA-Geolocated-tweets-free-dataset-Followthehashtag/dashboard_x_usa_x_filter_nativeretweets.xlsx", sheet = "Stream") %>%
# janitor::clean_names() %>%
# rename(tweet_language = tweet_language_iso_639_1) %>%
# filter(tweet_language == "en") %>%
# select(-tweet_language)
#
# mySentiment <- get_nrc_sentiment(tweets$tweet_content)
# us_tweets <- cbind(tweets, mySentiment)
# write.csv(us_tweets, file = "us_tweets.csv")
us_tweets <- read_csv("us_tweets.csv")
#gets rid of non alphabetic characters
us_tweets$tweet_content_stripped <- gsub("[^[:alpha:] ]", "", us_tweets$tweet_content)
#removes all words that are 1-2 letters long
us_tweets$tweet_content_stripped <- gsub(" *\\b[[:alpha:]]{1,2}\\b *", " ", us_tweets$tweet_content_stripped)
#removes amp, abbreviation for ampersan
us_tweets$tweet_content_stripped <- gsub(" amp ", " ", us_tweets$tweet_content_stripped)
#removes recurrence of jobs
us_tweets$tweet_content_stripped <- gsub("jobs", " job ", us_tweets$tweet_content_stripped)
us_tweets$tweet_content_stripped <- gsub("Jobs", " job ", us_tweets$tweet_content_stripped)
We used the Syuzhet package from GitHub (thank you Matthew Jockers!) to extract sentiments from tweet content. Our primary analyses consisted of mapping these tweets (using tweet location) as observable sentiments across the United States, which gives a nice aggregate picture of how the U.S. twitterverse was feeling during the dates mentioned above.
About the Sentiment function: Matthew Jockers’s sentiment function is essentially a dictionary that assigns different words to different sentiments. The general sentiments he uses in this function (and subsequently the ones we use in our analyses) are trust, joy, anger, sadness, fear, disgust, anticipation, and surprise. While some of these sentiments may not seem intuitive to use, altogether they form a relatively broad spectrum of moods and emotions which make for interesting analyses.
<<<<<<< HEADA shiny repo here for overall tweet analysis
=======Visualizations, summaries, and exploratory statistical analyses. Justify the steps you took, and show any major changes to your ideas.
Our additional shiny repos can be found: here for all US, here for individual states, and here for overall tweet analysis
What were your findings? Are they what you expect? What insights into the data can you make?
The ‘overall analysis’ shows summarizing information on the tweets overall, not divided by state or sentiment, in order to give a better understanding of the dataset as a whole. Initial data cleaning for text mining included creating a new variable called tweet_content_stripped, which took initial tweet content and used pattern matching to remove non-alphabetic characters (i.e. numbers and symbols were removed) and words that were only one or two letters long. Additionally, we subsetted the data to only include tweets that were in English.
The resulting shiny uses a subset of the data, only 100,000 tweets, since the Shiny App could not host such a large dataset. The two visualizations in the the tabset show the most frequent words overall in tweet_content_stripped. Note that after plots were made, “jobs” and “amp” were included in the top words, but these were later recoded in tweet_content_stripped. Since ‘job’ already appeared in the most frequent words, to avoid redundancy ‘jobs’ was recoded to ‘job’, and thus the top word in the bar chart includes count for the appearance of both jobs and jobs. ‘Amp’ was removed in the tweet_content_stripped, since we determined this was an abbreviation for ampersand and thus not meaningful for our data.
The output below shows total counts for all sentiments found in tweets in the dataset. Each tweet might have more than one sentiment.
sentimentTotals <- data.frame(colSums(us_tweets[,c(20:27)]))
names(sentimentTotals) <- "count"
sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals),
sentimentTotals)
sentimentTotals
## sentiment count
## anger anger 13605
## anticipation anticipation 52960
## disgust disgust 12668
## fear fear 19942
## joy joy 46690
## sadness sadness 21882
## surprise surprise 22067
## trust trust 76347
us_tweets_long <- gather(us_tweets, sentiment, count, anger:trust,
factor_key = TRUE)
Number of tweets per hour (below) is a time series plot showing the distribution of tweets at different hours of the day. The dataset includes recorded tweets across a 48 hour span, so we found it interesting that we are missing some time intervals at the same time both days. This could be related to the way in which the data was collected, but provided information on the data does not mention why these gaps occur.
us_tweets$hour <- as.POSIXct(us_tweets$hour, format = " %H:%M")
#generates plot of distribution of plots across time
tweets_over_time <-
ggplot(data = us_tweets, aes(x = hour)) +
geom_histogram(stat = "count") +
xlab("Time") + ylab("Number of Tweets") +
ggtitle("Number of Tweets per Hour") +
scale_x_datetime(labels = date_format("%H:%M"), breaks = pretty_breaks(n = 10))
This plot was part of our exploratory data analysis. It shows number of characters per tweet. Although interesting, we did not feel it was relevant to our final website analyses and was not included in the website.
us_tweets$charsintweet <- sapply(us_tweets$tweet_content, function(x) nchar(x))
ggplot(data = us_tweets, aes(x = charsintweet)) +
geom_histogram(aes(fill = ..count..), binwidth = 8) +
theme(legend.position = "none") +
xlab("Characters per Tweet") +
ylab("Number of tweets") +
scale_fill_gradient(low = "midnightblue", high = "aquamarine4") +
xlim(0,150) +
ggtitle("Characters per Tweet")
The sentiments plot below gives an general understanding of the most common sentiments throughout the tweets. When observing comparisons of states, it can be helpful to see the overall distribution of sentiments so that we can recognize states that have contrasting distributions of these sentiments.
ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
geom_bar(aes(fill = sentiment), stat = "identity") +
theme(legend.position = "none") +
xlab("Sentiment") +
ylab("Total Count") +
ggtitle("Total Sentiment Score for All Tweets in Sample")
The interactive word cloud in our shiny app does work on the Rmd, so the above word cloud is not the same as what we have on our website, but I will give a description of the word cloud that can be found on our website. The word cloud was generated using a subset where only words appearing more than 200 times are included. The website has a zoom slider allows you to zoom in and out to see some of the most and least frequent words (of frequency over 200). Word size corresponds to frequency, and zooming out to .5 shows the most frequent words used in the tweets, but due to the large proportion of more frequent words, it makes it very difficult to see less common words. Hovering the cursor over each word shows its frequency among all the tweets. The word cloud generated spaces the words differently and in different colors each time it is loaded, but the word sizes remain the same for each zoom amount.
tweet_words <- us_tweets %>%
unnest_tokens(word, tweet_content_stripped)
data(stop_words)
tweet_words <-
anti_join(tweet_words, stop_words)
tweet_words %>%
count(word) %>%
with(wordcloud(word, n, max.words = 200,
random.order = FALSE,
rot.per = 0.35,
colors = brewer.pal(2, "Dark2")))
The bar chart provides a more simple representation of common words, showing the top 10 most frequent words. These top 10 words, starting with the most frequent, are job, hiring, careerarc, click, retail, recommend, fit, hospitality, apply, and sales.
pal2 <- brewer.pal(8,"Dark2")
#creates bar chart of 10 most frequent words
top_words <-
tweet_words %>%
count(word, sort = TRUE) %>%
top_n(10) %>%
mutate(word = fct_reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_bar(stat = "identity", fill = "blue", alpha = .6) +
coord_flip() +
labs(title = "10 Most Frequent Words", y = "Count", x = "Word")
hashtags <- str_extract_all(us_tweets$tweet_content, "#\\S+")
hashtags <- unlist(hashtags)
hashtags <- gsub("[^[:alnum:] ]", "", hashtags)
hashtags <- tolower(hashtags)
hashtag.df <- data.frame(table(hashtags))
hashtag.df$hashtags <- as.character(hashtag.df$hashtags)
hashtag.df$Freq <- as.numeric(as.character(hashtag.df$Freq))
hashtag.df <- arrange(hashtag.df, desc(Freq))
print(hashtag.df[1:20,])
## hashtags Freq
## 1 job 51511
## 2 hiring 45428
## 3 jobs 21910
## 4 careerarc 20717
## 5 retail 7454
## 6 hospitality 7311
## 7 nursing 5091
## 8 healthcare 4702
## 9 veterans 4471
## 10 sales 3310
## 11 it 2179
## 12 customerservice 1927
## 13 transportation 1568
## 14 sonic 1520
## 15 manufacturing 1476
## 16 photo 1432
## 17 businessmgmt 1348
## 18 accounting 1053
## 19 engineering 970
## 20 traffic 955
A shiny repo for this analysis can be found: here
When mapping the positive scores for all tweets, we see that there is a moderate to low score through the US. At this scale, we cannot see a definitive trend at the state level. However, we do see that there are not a lot of tweets generated in the midwest or north west. There does seem that there are slightly more positive tweets from the middle of the country.
#positive tweets, ggplot
us_tweets %>%
filter(country == "US") %>%
ggplot(aes(x = longitude, y = latitude, color = positive)) +
geom_point(alpha = .6) +
scale_colour_gradientn(colours = rainbow(10)) +
ggtitle("Positive Tweets")
When mapping sentiment across all US, we see an overwhelming amount of “trust” tweets. We are not quite sure what this emotion means. We found that most tweets including “job” or “jobs” mapped to the emotion “trust.” There are many tweets with those words, so it may be interesting to filter out that emotion.
#name of sentiment, ggplot
us_tweets_long %>%
filter(country == "US") %>%
filter(count > 0) %>%
ggplot(aes(x = longitude, y = latitude, color = factor(sentiment))) +
geom_point(alpha = .6)+
ggtitle("Tweet Sentiments") +
scale_color_discrete(name="Sentiment")
When we filter out trust, we see that surprise and joy seem to be commonly tweeted emotions.
#name of sentiment, ggplot
us_tweets_long %>%
filter(country == "US") %>%
filter(count > 0) %>%
filter(sentiment != "trust") %>%
ggplot(aes(x = longitude, y = latitude, color = factor(sentiment))) +
geom_point(alpha = .6)+
ggtitle("Tweet Sentiments") +
scale_color_discrete(name="Sentiment")
Due to the fact that our location column displays differences in specificity, we built a function that took the latitude and longitude of each tweet and converted it to the state in which the tweet originated from. We then proceeded to add that to our original dataset.
state_tweets = us_tweets %>%
select("longitude", "latitude")
latlong2state <- function(state_tweets) {
states <- map('state', fill=TRUE, col="transparent", plot=FALSE)
IDs <- sapply(strsplit(states$names, ":"), function(x) x[1])
states_sp <- map2SpatialPolygons(states, IDs=IDs,
proj4string=CRS("+proj=longlat +datum=WGS84"))
states_tweets_SP <- SpatialPoints(state_tweets,
proj4string=CRS("+proj=longlat +datum=WGS84"))
indices <- over(states_tweets_SP, states_sp)
stateNames <- sapply(states_sp@polygons, function(x) x@ID)
stateNames[indices]
}
state_name = latlong2state(state_tweets)
us_tweets = cbind(state_name, us_tweets)
To evaluate overall sentiment by state, we selected the appropriate columns, then grouped and summed by state, making sure to discount missing locations. Maine, Alaska and Hawaii were not included in this survey, however the 48 state count comes from Virginia and the District of Columbia recieving individual designations.
us_sentiments = us_tweets %>%
filter(country == "US") %>%
select(c(1, 21:30)) %>%
na.omit(state_name) %>%
group_by(state_name) %>%
summarise_all(funs(sum)) %>%
mutate(positive = as.numeric(positive),
negative = as.numeric(negative))
The following heatmap shows the level of positive and negative sentiment across the United States during the 48 hour period of our dataset. Maine, Alaska and Hawaii are blacked out as tweets from those states were not recorded.
We can observe with these two maps that states like California and Texas are consistently the highest ranked, which can be assumed to be population related. It is interesting because the state with the lowest positive and negative sentiment scores is Washington. This could be for two reasons: population difference or that twweets have less sentimental words than other states and therefore don’t generate as strong sentiment scores.
us_sentiments %>%
select("state_name", "negative") %>%
rename(region = state_name, value = negative) %>%
state_choropleth(title = "Negative Sentiment across the U.S.",
legend = "Sentiment Score")
us_sentiments %>%
select("state_name", "positive") %>%
rename(region = state_name, value = positive) %>%
state_choropleth(title = "Positive Sentiment Across the U.S.",
legend = "Sentiment Score")
A shiny repo here for individual states
======= >>>>>>> fa71c03628c271cd85c017004a583f8c72cb5225